experimental/ssh: show compute provisioning status during ssh connect startup by TanishqDatabricks · Pull Request #5576 · databricks/cli

TanishqDatabricks · 2026-06-12T16:21:56Z

Changes

While the SSH server bootstrap job's compute spins up, the spinner now reads Waiting for compute to start... (all connection types) instead of Starting SSH server.... For GPU accelerators, a persistent notice is printed upfront: Waiting for GPU_8xH100 compute to be provisioned. This can take upwards of 10 minutes depending on capacity....

Why

ssh connect --accelerator=GPU_8xH100 frequently fails with:

Error: failed to ensure that ssh server is running: failed to submit and start ssh server job: timed out: waiting for task to start (current state: PENDING)

GPU_8xH100 launch latency is ~10 minutes at P50 and ~30 minutes at P90, so sessions routinely hit the startup timeout even when nothing is wrong. Nothing in the output indicated that compute was being provisioned, so users read the error as a service outage.

Tests

go build, go vet, and go test ./experimental/ssh/... all pass; TestWaitForJobToStartSurfacesFailure updated for the waitForJobToStart signature change.
The change is display-only (spinner and notice text); no control flow or error behavior is modified.

This pull request and its description were written by Isaac.

… startup GPU_8xH100 serverless capacity takes ~10 minutes at P50 and ~30 minutes at P90 to acquire, but while waiting `ssh connect` only showed a generic "Starting SSH server... (task: PENDING)" spinner, so users assumed a long wait meant a service outage (see the Zillow report in #remote-development-help). Show "Waiting for compute to start..." while the bootstrap job's compute spins up (all connection types, including dedicated-cluster auto-start), and print an upfront notice for GPU accelerators that provisioning can take upwards of 10 minutes. The startup timeout increase for GPU accelerators is handled separately. Co-authored-by: Isaac

eng-dev-ecosystem-bot · 2026-06-12T17:09:58Z

Integration test report

Commit: 569e075

Run: 27999475012

	Env	🟨KNOWN	✅pass	🙈skip	Time
🟨	aws linux	1	216	99	7:35
🟨	aws windows	1	218	97	2:35
🟨	aws-ucws linux	1	297	18	3:41
🟨	aws-ucws windows	1	299	16	3:24
🟨	azure linux	1	216	98	5:34
🟨	azure windows	1	218	96	2:30
🟨	azure-ucws linux	1	299	15	13:10
🟨	azure-ucws windows	1	301	13	3:17
🟨	gcp linux	1	215	100	6:27
🟨	gcp windows	1	217	98	2:29

	Test Name	aws linux	aws windows	aws-ucws linux	aws-ucws windows	azure linux	azure windows	azure-ucws linux	azure-ucws windows	gcp linux	gcp windows
🟨	TestAccept	🟨K	🟨K	🟨K	🟨K	🟨K	🟨K	🟨K	🟨K	🟨K	🟨K

Top 5 slowest tests (at least 2 minutes):

duration	env	testname
7:01	azure-ucws linux	TestSQLExecScalar
6:54	aws linux	TestSecretsPutSecretStringValue
5:53	gcp linux	TestSecretsPutSecretStringValue
4:52	azure linux	TestSecretsPutSecretStringValue
4:28	azure-ucws linux	TestSecretsPutSecretStringValue

anton-107

Thanks — the diff is clean and the intent is right. Two requested changes on the provisioning notice, both about the wording.

1. Differentiate the message by accelerator type

Right now GPU_1xA10 and GPU_8xH100 get the identical "upwards of 10 minutes" notice, but their provisioning latencies differ a lot — a single A10 is typically acquired much faster than an 8×H100 node. Telling an A10 user to expect 10+ minutes is misleading, and the 8×H100 case arguably warrants a stronger heads-up (P90 ~30 min).

Suggest keying the message off opts.Accelerator — e.g. a small map[string]string of accelerator → expected-time phrasing, with a generic fallback for anything not in the map. That also keeps it correct as new accelerator types are added.

2. Tighten the wording

"upwards of 10 minutes" is a touch informal and slightly misrepresents the data: with P50 ≈ 10 min it implies 10 min is the floor, when in fact roughly half the time it finishes faster — and the real pain is the ~30 min P90 that drove the 45-min timeout in #5569. Anchoring on a range is more useful to someone staring at a long PENDING state. The trailing ... also reads casual for a one-time sentence (vs. the ongoing spinner text, where it fits).

Suggested wording:

GPU_8xH100: Provisioning GPU_8xH100 compute. This typically takes around 10 minutes and can exceed 30 minutes when capacity is constrained.
GPU_1xA10: Provisioning GPU_1xA10 compute. This usually takes a few minutes, longer when capacity is constrained. (adjust to the latency we actually observe)

The matching spinner text can stay short, e.g. Provisioning GPU_8xH100 compute....

The provisioning heads-up for GPU accelerators was identical for every type and said "upwards of 10 minutes", which is misleading: a single GPU_1xA10 is typically acquired in a few minutes, while a GPU_8xH100 node is ~10 min at P50 and can exceed 30 min at P90. Key the notice off the accelerator type via a small map with a generic fallback, and anchor the wording on a range rather than a floor so it stays useful to someone staring at a long PENDING state. Co-authored-by: Isaac

TanishqDatabricks mentioned this pull request Jun 12, 2026

experimental/ssh: show compute provisioning status during ssh connect startup #5572

Closed

TanishqDatabricks temporarily deployed to test-trigger-is June 12, 2026 16:22 — with GitHub Actions Inactive

TanishqDatabricks requested a review from anton-107 June 12, 2026 16:23

anton-107 requested changes Jun 17, 2026

View reviewed changes

TanishqDatabricks temporarily deployed to test-trigger-is June 23, 2026 03:06 — with GitHub Actions Inactive

TanishqDatabricks requested a review from anton-107 June 23, 2026 03:07

Update client.go

a77bde9

TanishqDatabricks temporarily deployed to test-trigger-is June 23, 2026 03:08 — with GitHub Actions Inactive

Update client.go

569e075

TanishqDatabricks temporarily deployed to test-trigger-is June 23, 2026 03:10 — with GitHub Actions Inactive

anton-107 approved these changes Jun 23, 2026

View reviewed changes

TanishqDatabricks added this pull request to the merge queue Jun 23, 2026

TanishqDatabricks removed this pull request from the merge queue due to a manual request Jun 23, 2026

TanishqDatabricks added this pull request to the merge queue Jun 23, 2026

Merged via the queue into main with commit 4418351 Jun 23, 2026
22 checks passed

TanishqDatabricks deleted the ssh-connect-gpu-startup-ux branch June 23, 2026 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

experimental/ssh: show compute provisioning status during ssh connect startup#5576

experimental/ssh: show compute provisioning status during ssh connect startup#5576
TanishqDatabricks merged 4 commits into
mainfrom
ssh-connect-gpu-startup-ux

TanishqDatabricks commented Jun 12, 2026 •

edited

Loading

Uh oh!

eng-dev-ecosystem-bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

anton-107 left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

TanishqDatabricks commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Why

Tests

Uh oh!

eng-dev-ecosystem-bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration test report

Uh oh!

anton-107 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

1. Differentiate the message by accelerator type

2. Tighten the wording

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TanishqDatabricks commented Jun 12, 2026 •

edited

Loading

eng-dev-ecosystem-bot commented Jun 12, 2026 •

edited

Loading

anton-107 left a comment •

edited

Loading